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Abstract 

The universal scalability law of computational capacity is a rational function C p = P(p)/Q(p) 
with P(p) a linear polynomial and Q(p) a second-degree polynomial in the number of physical 
processors p, that has been long used for statistical modeling and prediction of computer system 
performance. We prove that C p is equivalent to the synchronous throughput bound for a machine- 
repairman with state-dependent service rate. Simpler rational functions, such as Amdahl's law 
and Gustafson speedup, are corollaries of this queue-theoretic bound. C p is further shown to be 
both necessary and sufficient for modeling all practical characteristics of computational scalability. 

1 Introduction 

For several decades, a class of real functions called rational functions [T], has been used to represent 
throughput scalability as a function of physical processor configuration. In particular, Amdahl's 
law [2], its modification due to Gustafson [3] and the Universal Scalability Law (USL) [3] have 
found ubiquitous application. In this context, the relative computing capacity, C p , is a rational 
function of the number of physical processors p. It is defined as the quotient of a polynomial P(p) 
in the numerator and Q(p) in the denominator, i.e., C p — P(p)/Q(p). Each of the above-mentioned 
scalability models is distinguished by the number of coefficients or fitting parameters associated 
with the polynomials in P(p) and Q(p). For example, Amdahl's law and Gustafson's modification 
are single parameter models, whereas the USL model contains two parameters. 

Despite their historical utility, these models have stood in isolation without any deeper physical 
interpretation. It has even been suggested that Amdahl's law is not fundamental [5]. More 
importantly, the lack of a unified physical interpretation has led to the use of certain flawed 
scalability models [6]. In this note, we demonstrate that the aforementioned class of rational 
functions corresponds to certain performance bounds belonging to a queue-theoretic model. 

The idea that Amdahl's law, which has most frequently been associated with the scalability 
of massively parallel systems, can be considered from a queue-theoretic standpoint, is not entirely 
new [See e.g., [7, .8,. However, quite apart from motivations entirely different from our own, 
those previous works employed open queueing models with an unbounded number of requests (See 
Appendix [c]), whereas we shall use a closed queueing model with a finite number of requests 
p corresponding to the number of physical processors. The USL function is associated with a 
state-dependent generalization of the machine repairman [5] . 

The organization of this paper is as follows. We briefly review the scalability models of interest 
in Sect. [2] The appropriate queueing metrics associated with the standard machine repairman and 
its state-dependent extension are discussed in Sect. [3] The performance characteristics associated 
with synchronous queueing are also presented there. The main theorem (Theorem^ is established 
in Sect. [4] Amdahl's law and Gustafson's linear speedup are shown to be corollaries of this theorem. 
Finally, in Sect.[5]we prove an earlier conjecture that a rational function with Q(p) a second-degree 
polynomial is both necessary and sufficient to model all practical cases of computational scalability. 
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2 Parametric Models 



Although technically, we are discussing rational functions, we shall hereafter refer to them as 
parametric models, and the coefficients as parameters, since the primary application of these 
models is nonlinear statistical regression of performance data [See e.g.. f^l [TU1 1111 and references 
therein] . 




Figure 1: Parametric models: USL (red), Amdahl (green), Gustafson (blue), with parameter values 
exaggerated to distinguish their typical characteristic relative to ideal linear scaling (dashed). The 
horizontal line is the Amdahl asymptote at er^ 1 



Definition 1 (Speedup). If an amount of work N is completed in time Ti on a uniprocessor, the 
same amount of work can be completed in time T p < T\ on a p-way multiprocessor. The speedup 
S p = Ti /T p is one measure of scalability. 



2.1 Amdahl's law 

For a single task that takes time T\ to execute on a uniprocessor (p = 1), Amdahl's law [2] 
states that if the task can be equipartitioned onto p processors, but contains an irreducible fraction 
of sequential work a £ [0, 1], then only the remaining portion of the execution time (1 — o)T\ 
can be executed as p parallel subtasks on p physical processors. The bound on the achievable 
equipartitioned speedup |T3] is given by the ratio 

S P {*) = 3 v— (1) 

which simplifies to 

a rational function with P(p) = p and Q(p) a first-degree polynomial. As the processor configu- 
ration is increased, i.e., p — > oo, the number of concomitantly smaller subtasks also increases and 
the speedup approaches an asymptote, 

S p {a) ~ cr" 1 , (3) 

shown as the horizontal in Fig. [l] 
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2.2 Gustafson's speedup 



Amdahl's law assumes the size of the work is fixed. Gustafson's modification is based on the idea 
of scaling up the size of the work to match p. This assumption results in the theoretical recovery 
of linear speedup 

S p G (a) =<T + (l-a)p (4) 

Equation Q is a rational function with Q(p) = 1 and P(p) a first-degree polynomial in p. 

Although Q has inspired various efforts for improving parallel processing efficiencies, achieving 
truly linear speedup has turned out to be difficult in practice. Most recently, Q has been proposed 
as a way to optimize the throughput of multicore processors [14| . 

Definition 2 (Relative Capacity). As an alternative to the speedup, scalability can also be defined 
as the relative capacity, C v = X(p)/X(l), where X(p) is the throughput with p processors, and 
X(l) the throughput of a single processor. 



2.3 Universal Scalability Law (USL) 

The USL model [H I1UI ITT] is a rational function with P{p) = p and Q(p) a second-degree polyno- 
mial: 

C„(<7, ft) = ; ; r- (5) 

PK ' ; 1 + <7(p - 1) + K P (j) - 1) V ; 

where the coefficients belonging to the terms in the denominator have been regrouped into three 
terms with two parameters (a, k). These terms can be interpreted as representing: 

1. Ideal concurrency associated with linear scalability (a, k = 0) 

2. Contention-limited scalability due to serialization or queueing (cr > 0, k = 0) 

3. Coherency-limited scalability due to inconsistent copies of data (a, k > 0) 

Table [I] summarized how these parameter values can be used to classify the scalability of different 
types of applications. 

Equation |5| subsumes (pjl and Q. In particular, Q is identical to |5| with k = 0. The key 
distinction is that, unlike (|2j, (JsJ possesses a maximum at 



P=\l— (6) 



which is controlled by the parameter values according to: 

CO 

-+ 
a -> 
1 



(a) 


P* 


— > as k —> 


(b) 


P* 
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(c) 


P* 


K - 1/2 as 


(d) 


P* 


— > as o — > 



The important implication is that beyond p* the throughput becomes retrograde. See Fig. [T] This 
effect is commonly observed in applications that involve shared- writable data (Case D in Table [lj. 

In the subsequent sections, we attempt to provide deeper insight into the physical significance 
of |5| by recognizing its association with queueing theory. 



3 Queueing Models 

The machine repairman (Fig.[2| is a well-known queueing model [15] which represents an assembly 
line comprising a finite number of machines p which break down after a mean lifetime Z. A repair- 
man takes a mean time S to repair a broken machine and if multiple machines fail, the additional 
machines must queue for service in FIFO order. The queue-theoretic notation, M/M/l/ /p, implies 
exponentially distributed lifetimes and service periods with a finite population p of requests and 
buffering. 
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Table 1: Application domains for the USL model 



A: Ideal concurrency (a, n = 0) 


B: Contention-limited (cr > 0, k = 0) 


Single-threaded tasks 
Parallel text search 
Read-only queries 


Tasks requiring locking or sequencing 

Message-passing protocols 

Polling protocols (e.g., hypervisors) 


C: Coherency-limited (a = 0, K > 0) 


D: Worst case (cr, n > 0) 


SMP cache pinging 

Incoherent application state between 

cluster nodes 


Tasks acting on shared-writable data 
Online reservation systems 
Updating database records 




Figure 2: Conventional machine repairman queueing model comprising p machines with mean uptime 
Z (top) and a repair queue (bottom) with mean service time S. 

3.1 Queueing Metrics 

The performance characteristics of interest for the subsequent discussion are the throughput X(p) 
and residence time R(p)- 

Definition 3 (Throughput). The throughput, X = N/T, is the number of tasks N completed in 
time T. 

Definition 4 (Residence Time). The mean residence time R — W + S is the sum of the time spent 
waiting for service W plus the actual repair time S when the repairman services the machine. 

On average, the number of machines that are "up" is ZX, while Q are "down" (for repairs), 
such that the total number of machines in either state is given by p — Q + ZX. Rearranging this 
expression produces: 

Q = p - ZX 
and applying Little's law 15 (Q = XR) [15] 

XR = p - ZX 

gives 

m-jfa-z (7) 

for the mean residence time at the repair station. Rearranging |7| provides an expression for the 
mean throughput of the repairman as a function of p: 

X ^ - m+z w 

Definition 5 (Mean RTT). The denominator in (JsJ> , R(p) + Z, is the mean round-trip time (RTT) 
for M/M/l//p. 
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Table 2: Interpretation of the queueing metrics in Fig. [2] 

Interpretation 



Metric Repairman Multiprocessor Time share 



p 


machines 


processors 


users 


z 


up time 


execution period 


think time 


s 


service time 


transmission time 


CPU time 


R(p) 


residence time 


interconnect latency 


run-queue time 


X(p) 


failure rate 


bandwidth 


throughput 



Since M/M/l/ /p is an abstraction, it can be applied to different computational contexts. 
In the subsequent sections, the p machines will be taken to represent physical processors and 
the time spent at the repair station is taken to represent the interconnect latency between the 
processors |16l I17| . See Table [2] This choice is merely to conform to the conventions most 
commonly used in discussions of parallel scalability [11] , but the generic nature of queueing model 
means that any conclusions also hold for software scalability [See e.g.. 1101 Chap. 6]. 

3.2 Synchronous Queueing 

We consider the special case of synchronous queueing in M/M/l/ /p. The queue-theoretic perfor- 
mance metrics denned in Sect. [3d] are for the steady-state case and therefore each corresponds to 
the statistical mean of the respective random variable. Moreover, as already mentioned, we cannot 
give an explicit expression for R(p) since its value is dependent on the value of X(p), which is also 
unknown in steady state. 

Definition 6 (Synchronized Requests). If all the machines in M/M/l/ /p break down simultane- 
ously, the queue length at the repairman is maximized such that the residence time in definition [4] 
becomes R(p) = pS. This situation corresponds to one machine in service and (p — 1) waiting for 
service and provides a lower bound on Q [T5] Q32 |2U] : 

psTz^ x ^ < 9 > 

In the context of multiprocessor scalability (Table it is tantamount to all p processors simulta- 
neously issuing requests across the interconnect. 

Remark 1 (A Paradox) . Consider the case where all p processors have the same deterministic Z 
period. At the end of the first Z period, all p requests will enqueue at the interconnect (lower 
portion of Fig. [2J simultaneously. By definition, however, the requests are serviced serially, so they 
will return to the parallel execution phase (top portion of Fig. [5| separately and thereafter will 
always return to the interconnect at different times. In other words, even if the queueing system 
starts with synchronized visits to the interconnect, that synchronization is immediately lost after 
the first tour because it is destroyed by the serial queueing process. The resolution of this paradox 
is discussed in Appendix [B] 

Definition 7 (Synchronous RTT). In the presence of synchronous queueing, the mean RTT of 
definition [5] becomes pS + Z. 

Lemma 1. The relative capacity C p and the speedup S p give identical values for the same processor 
configuration p. 

Proof. Let the uniprocessor throughput is defined as X\ — N/Ti and the multiprocessor through- 
put is X p — N/Tp. Hence 

_ X(p) _ N Ti _ 
° p ~ X(l) ~ T P N 

follows from definition [l] □ 
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Lemma 2 (Serial Fraction). The serial fraction !(2.1\ ) can be expressed in terms of M/M/l/ /p 
metrics by the identity JfJ 1 10\ 



S + Z 



as S — > 0, Z — const., 

1 as Z — *■ 0, 5* = const. 



(10) 



If a = 0, t/ien t/iere is no communication between processors and the interconnect latency there- 
fore vanishes (maximal execution time). Conversely, if a — 1, then the execution time vanishes 
(maximal communication latency). 

Proof. The RTT for a single (unpartitioned) task in Fig. [5] is 2\ = S + Z. The RTT for a p 
equipartitioned subtasks is T p — p{S/p) + Z/p. From definition [Tj the corresponding speedup is 

s * = ^zT P M 



Equating |TTJ with Q, we find 



S = o-Ti and Z = (1 - a)T a 



Eliminating T\ produces 



which, upon solving for a, produces ( 10 1 



See Appendix |B] for another perspective. 
Definition 8. The quantity 



Z 

s 



(12) 

(13) 

□ 

(14) 



is the service ratio for the M/M/l/ /p model. 

Theorem 1 (Speedup Duality). Let (a, n) be a continuous dual-parameter pair with a is the serial 
fraction (10) and tt = 1 — a. The Amdahl speedup (q) is invariant under scalings of (a, it) by p. 



Proof. Using definition [8] theorem [T] can be represented diagrammatically as: 

tt/o- = Z/S 



Z/p 

-> s 



Z\ 
Si- 



* z 

pS 



(15) 



S p (a) 



The path on the left hand side of ( 15 1 corresponds to reducing the single task execution time by p 
(subtasks) while the interconnect service time remains unchanged. This follows from definition [6j 
7? = pS, but the service time for each subtask is also reduced to S/p. Hence, R — S. Conversely, 



the right hand path of ( 15 1 corresponds to p tasks, each with unchanged execution time Z, but 
scaled service time R = pS. Both paths result in Amdahl's law S, which can be seen by first 
rewriting (111 in terms of the service ratio Z/S (definition [8| : 



Sp — 



l + Z/S 



1 + 



1 z 



p\S 



(16) 



(a) Interpreting the denominator in ( 16 1 as belonging to the left hand path of (151 leads to the 
expansion 

S + Z 
v ~ S+Z a 



+ S-- 
p p 

(S + Z)/S 

l( S+Z \ p-1 

p\ S ) + p 



G 



Collecting terms and simplifying produces: 



Sp 



1 + 



s 



^s + z 

which is identical to |2| upon substituting (10 1. 



(P-1) 



(b) Following the right hand path in ( 15 I leads to the expansion 

_ pS(l + Z/S) 
pS + Z 



p 



s 



z s 

+ r, . Ti + 



s 



S + Z) s+z s+z s+z 

Collecting terms in the denominator produces: 

Sri = 



s 



s + z 



+ 1 



s + z 



which also yields (J2J via (10 1. 



(17) 



(18) 



□ 



Remark 2. Theorem [T] anticipates the interpretation of Gustafson's law as a consequence of scaling 
the work size Z i— > pZ in M/M/l/ /p. See corollary [S] 



3.3 State-Dependent Service 

We now consider a generalization of this machine repairman model in which the residence time 
R(p) includes an additional time that is proportional to the load on the server, expressed as the 
number of enqueued requests. Since the queue-length is a canonical measure of the state of the 
system, the repairman becomes a state- dependent server [9j [15], denoted M/G/l/ /p. Let the 
additional service time be S' in the state-dependent progression: 



p = l: R(1) = 1S 

p = 2: R(2) = 2(S + S') 

p = 3: R(3) = 3 (S + 2 S') 

p = 4: R(4) = 4 (S + 3 S') (19) 

p = p: R(p)=p(S+(p-l)S') 



The extra time spent by each machine at the repair station increases linearly with the additional 
number "down" machines. There is no stretching of the mean service time, S, when repairing a 
single machine. 

Remark 3. In general, it is expected that S' < S. It could, however, be a multiple of S, but that 
is clearly undesirable. 

Some example applications of the state-dependent M/G/l/ /p model in a computational context 
include: 

(a) Pairwise Exchange: Modeling the performance degradation due to combinatoric pairwise 
exchange of data between p multiprocessor caches or cluster nodes. See Sect. [5] 

(b) Broadcast Protocol: If any processor broadcasts a request for data, the other (p— 1) processors 
must stop and respond before computation can continue [llj . 

(c) Virtual Memory: Each task is a program with its own working set of memory pages. Page 
replacement relies on a higher latency device, such as a disk. As the number of programs p 
increases, page replacement latency causes the system to "thrash" such that the throughput 
to become retrograde [9] . Cf. Fig. [2] 
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4 Parametric Models as Queueing Bounds 

In this section, we show that the parametric scalability models in Section [2] correspond to certain 
throughput bounds on the queueing models in Section [3] 

Theorem 2 (Main Result). The universal scalability law ^) is equivalent to synchronous relative 
throughput in M/G/l/ /p. 



Proof. Let 5' — cS in (191, with c a positive constant of proportionality. The residence time for 
state-dependent, synchronous-requests becomes 

R{p) = P S + cp{p - 1)5 (20) 

Substituting (20 1 into definition [2] 

P(S + Z) 



P S + cp(p-l)S + Z 
P(S + Z) 



(21) 



[p- l)S + (S + Z) + cp( P - 1)5 

= P(S + Z) 

(S + Z)[l + (p- 1)5(5 + Z)- 1 + cp(p - 1)5(5 + Z)- 1 ] 

Collecting terms and simplifying produces: 

° P = l + a(p-l) P +cap(p~l) ' (22) 

where we have applied the identity for the serial fraction in lemma [2] Combining the coefficients 
of the third term in the denominator of ( 22 \ as k = ca, yields |5|. □ 

Remark 4. Since c is an arbitrary constant, c > implies that the parameter k — ca in |5} can 
be unbounded, whereas a < 1 always. 

Remark 5. The state-dependence of R(p) in (201 does not change lemma [2] since a is determined 
by 5 and Z only and both of those queueing metrics are constants. 

Corollary 1 (Amdahl's law). Amdahl's law is the synchronous bound on relative throughput 
in M/M/l//p. 

Proof. Follows immediately from the proof of theorem [2] with c = in (21 1. □ 

Remark 6. Elsewhere [6l 1101 ITT] . corollary [TJ was proven as a separate theorem. 

Corollary 2 (Gustafson's law). Gustaf son's law corresponds to the rescaling Z >— > pZ in 

M/M/l/fp. 



Proof. Since Gustafson's result is a modification of Amdahl's law, we start with (21 1 and let c = 0. 
Under Z t—> pZ, the scalability function becomes: 

_ P(S + P Z) 
° p - pS + pZ 
= S + pZ 

s + z 

_ S + p(Z + S-S) 

s + z 

Once again, after application of lemma [2j this simplifies to 

C p = a + p — ap (23) 
which is identical to the linear speedup Sp in (|4l. □ 



Remark 7. Rewriting ( 23 1 as C' p = p+a/(l — cr), we note that the additive constant cr/(l— a) = S/Z 
is the inverse of the service ratio in definition [8] In the context of M/M/l//p, rescaling the 
execution time, Z i— > pZ, prior to partitioning, adds a fixed overhead (pS/pZ) to an otherwise 
linear function of p, whereas the overhead in Amdahl's law is an increasing function of p. 
The results of this section have also been confirmed using event-based simulations [12] . 
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5 Extrema and Universality 



Ideal linear scalability (C p ~ p, for p > 0) has a positive and constant derivative. More realistically, 
large processor configurations (p — > oo) are expected to approach saturation, i.e., the asymptote 

Csat ~ C • 

Any physical computational system that develops a scalability maximum at p* in the positive 
quadrant (p > 0), means that C p must have a negative derivative for p > p„ and therefore relative 
computational capacity falls below the saturation value or C p < Csat, i.e., throughput performance 
becomes retrograde. Since this behavior is undesirable, there is little virtue in characterizing the 
maximum beyond the ability to quantify its location (p„) using a given scalability model. This 
observation leads to the following conjecture [See also 1101 p. 65], which we now prove. 

Conjecture 1 (Universality). For a rational function C v = P(p)/Q(p) with P(p) — p, a necessary 
and sufficient condition for C v to be a model of computational scalability is, Q(p) = 1 +ai p + a2 p 
with coefficients a 1 ,a,2 > 0. 

The simplest line of proof comes from considering latency rather than throughput. See Sec- 
tion [D] for further discussion. 

Proof. Ideal latency reduction, T p — T\/p, is a hyperbolic function and therefore has no extrema. 
The additional latency due to pairwise interprocessor communication introduces the combinatoric 
term, f^) = p(p — l)/2, such that the total latency becomes 

T p (k) = ^+*5.(p-1) (24) 
p 2 



with constant k > 0. Equation( |24[ ) has a unique minimum for p > (Fig. pi. Substituting ( |24[ ) 
into the speedup definition [I] produces 

where we have absorbed the factor of 2 in k. S p (k) now possesses a unique maximum in the 



positive quadrant (p > 0). Thus, the quadratic term in the denominator of (251 is necessary for 
the existence of a maximum but it is not sufficient because S p (k) does not exhibit the Amdahl 
asymptote a" 1 when k = 0. However, the two-parameter latency 

T p (a, K ) = ^+aT 1 (^-)+K^-(p-l) (26) 
p VP/ 2 

does introduce the required term into (251 and, by lemma [I] is identical to ([5jl . □ 



The second term in ( 26 1 can be interpreted as the fixed time it takes any one processor to 
broadcast a request for data and wait for the remaining fraction of processors, (p — l)/p, to 
respond simultaneously. This is also another way to view synchronous queueing in M/M/l/ /p. It 
simply introduces a lower bound, oT\, on the latency reduction. 

The third term in ( |26[ ) is analogous to Brook's law [2T]: "Adding more manpower to a late 
software project makes it later," with p interpreted as people rather than processors. Here, for 
example, it can be interpreted as the latency due to the pairwise exchange of data to maintain 
cache coherency in a multiprocessor. 



6 Conclusion 

Several ubiquitous scalability models, viz., Amdahl's law, Gustafson's law and the universal scal- 
ability law (USL), belong to a class of rational functions. Treated as parametric models, they are 
neither ad hoc nor unphysical. Rather, they correspond to certain bounds on the relative through- 
put of the machine repairman queueing model. In the most general case, the main theorem [2] 
states that the USL model corresponds to the synchronous throughput bound of a load-dependent 
machine repairman. USL subsumes both Amdahl's law and Gustafson's law as corollaries of theo- 
rem]^ As well as providing a more physical basis for these scalability models, the queue-theoretic 
interpretation has practical significance in that it facilitates prediction of response time scalability 
using (TTb and provides deeper insight into potential performance tuning opportunities. 
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T(p) 




Figure 3: A minimum occurs in the total latency T(p) due to an increasing pairwise-exchange time 
being added to the initial latency reduction. 



Appendices 

A Why Universal? 



The term "universal" is intended to convey the notion that the USL (defined in Sect. 2.3 1 can be 
applied to any computer architecture; from multi-core to multi-tier. This follows from the fact that 
there is nothing in (JsJ that explicitly represents any particular system architecture or interconnect 
topology. That information is present but it is encoded in the numeric value of the parameters a 
and k. The same could be said for Amdahl's law but the difference is that, being a rational function 
with linear Q(p), Amdahl's law cannot predict the retrograde scalability commonly observed in 
performance evaluation measurements [5] [TD] . As proven in Sect. [5] the USL is both necessary and 
sufficient to model all these effects. 

The USL does not exclude defining a more general or more complex scalability model to account 
for such details as, heterogeneous processors or the functional form of degradation beyond p* , but 
any such model must include the USL as a limiting case. The best analogy might be to regard 
the USL as being akin to Newton's universal law of gravitation. Here, "universal" means generally 
applicable to any gravitating bodies. Newton's theory has been superseded by a more sophisticated 
theory of gravitation; Einstein's general theory of relativity. Einstein's theory, however, does not 
negate Newton's theory but rather, contains it as a limiting case, when space-time is flat. Since 
space-time is flat in all practical applications, NASA uses Newton's equations to calculate the 
flight paths of all its missions. 



B Synchronous Queueing 

The proofs of theorem[2](USL) and corollary [I] (Amdahl) employ mean value equations for metrics 
which characterize steady state conditions. As noted in remark [l] synchronized queueing cannot 
be maintained in steady state. Synchronization and steady state are not compatible concepts 
because the former is an instantaneous effect, whereas the analytic solutions we seek are only valid 
in long-run equilibrium. 

Elsewhere [101 112j . we have shown that a necessary requirement for maintaining synchronous 
queueing is to introduce another buffer in addition to the waiting line (Wx) at the repairman. If 
the extra buffer represents a post-repair collection point, such that each repaired machine (com- 
pleted request) is held "off-line" until all p machines are repaired then, synchronous queueing is 
maintained provided the Z periods are i.i.d. deterministic. The extra buffer acts as a barrier syn- 
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chronizer. Unfortunately, this is a D/M/l//p queue, whereas the proof of corollary [T] is based on 
M/M/l/ /p. Moreover, the repairman performance metrics are robust [22], so our results should 
also hold for G/M/l// P . 

In more recent simulation experiments [H] , we have shown that this restriction on Z periods can 
be lifted by positioning the buffer as a pre-repair waiting room. Instead of requiring all p machines 
to break down and enqueue simultaneously, we allow any number, less than p, to fail but with 
the added constraint that when any machine invokes service at the repairman, all other machines 
(or executing processors) must suspend operations as well, i.e., visit the suspension buffer. Under 
these conditions, Z can be G-distributed. Because this intermittent synchronization occurs with 
much higher frequency and for much shorter average time periods than barrier synchronization, 
the potential impact of the G-distributed tails on the Z periods is truncated. 

Synchronization can be treated as a two-state Markov process, e.g., A: parallel and B: serial, 
where the B state includes those processes that are suspended as well as waiting for service. If Xa 
is the transition rate for A — > B and Xb for B — > A, then the probability of being in state B is 
given by 

Pr(B) = y^y- (27) 

In the previous scenario it only takes a single machine to fail to suspended all other machines. The 
failure rate is therefore Xa = l/Z and the service rate is As = 1/S. Substituting these into (271 
produces and expression identical to the serial fraction a in 1 10 1. In state B, some fraction pi are 
enqueued and the remainder p2 = p — pi are suspended. On average, any machine can expect to 
spend time R = (pi + P2)S to complete repairs. Hence, the total serial time is 7? = pS, which is 
the quantity that appears in the proofs. 



C Queueing Models of Amdahl's Law 

Others have also considered Amdahl's law from a queue-theoretic standpoint [See, e.g., [JJ [S]. Of 
these, [8] is closest to our discussion, so we briefly summarize the differences. 

First, the motivations are quite different. The author of [S], like many other authors, seeks 
clever ways to defeat Amdahl's law; in the sense of Gustafson (Sect. [2^2] ), whereas we are trying 
to understand Amdahl's law by providing it with a more fundamental physical interpretation. 
Ironically, both investigations invoke queue-theoretic models to gain more insight into the pertinent 
issues; an open (M/M/m) queue in [5], a closed queue (Fig. [2]) in this paper. 

Second, two steps are undertaken to define an alternative speedup function: 

1. An attempt to unify both the Amdahl and Gustafson equations into a single speedup function. 

2. Extend that unified speedup function to include waiting times. 

The overarching goal is to find waiting-time optima for this unified speedup function. The unifica- 
tion step is achieved by purely algebraic manipulations and does not rest on any queue-theoretic 
arguments. The open queueing model provides an ad hoc means for incorporating waiting times 
as a function of queue length. The subsequent analysis is based entirely on simulation results and 
thereafter departs significantly from the analytic approach of this paper. 

By virtue of our approach, we have shown that both Amdahl and Gustafson scaling laws are 
unified by the same queueing model, viz., the machine-repairman model. Moreover, corollary [I] 
is a lower bound on throughput; synchronous throughput, and therefore represents worst-case 
scalability. With this physical interpretation, it follows immediately that Amdahl's law can be 
"defeated" more conveniently than proposed in [8] by simply requiring that all requests be issued 
asynchronously |12j . 

D Remarks on the Proof of Conjecture [l] 

The proof in Section [5] is simplified by using the additive properties of the latency function T p 
rather than inverting the rational function f(p) = R(p) /Q(p) directly. Since Q(p) is quadratic 
in p, the temptation is to consider the inverse of / _1 and use the fact that a general quadratic 
function has a minimum. 
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To see why this approach runs into difficulties, consider the simplified representation of Q 

P 



f(p) = 



1+p + p 2 



(28) 



with all unit coefficients. Equation ( |28| is not invertible in the formal sense because / _1 is not 
a one-to-one mapping, even in the positive quadrant. We can, however, consider the full inverse 
with its branches, as shown in Fig. [4] The principal branch is shown in light bl ue w ith the other 
branches occurring at the extrema of / . This corresponds to the extrema of (281 occurring at 
p = ±1 => /(l) = ±1/3. Hence, J" 1 € C, Vp > 1/3. 





Figure 4: The complete rational function (28) (red) and its inverse (blue) 



Alternatively, choosing the denominator be a perfect square: 

P 

l + 2p + p 2 

( 29 1 can be expressed either as a product of linear factors: 

P I p \ I 1 



(29) 



or a partial fraction expansion: 



1 + 2p + p 2 



P 



1+pJ \l+p 



1 



l + 2p + p 2 1+p (1+; 



Although such decompositions are suggestive of the need for two parameters (Fig. [5]), they 
would seem to obscure the proof of Theorem [l] rather than illuminate it. Using the latency function 
T p and then "inverting" it to produce the corresponding throughput scaling using lemma [T] avoids 
these problems. 
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Figure 5: Amdahl scaling (red), envelope function (1+p) 1 (green), their convolution (solid blue) and 
equation (|28[) (dashed blue). The small difference in the latter two curves arises from the factor of 2 in 



the denominator of equation ( 29 1 . 
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